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Abstract 

Problem Statement: Learning-teaching activities bring along the need to 
determine whether they achieve their goals. Thus, multiple choice tests 
addressing the same set of questions to all are frequently used. However, 
this traditional assessment and evaluation form contrasts with modern 
education, where individual learning characteristics are featured. Hence, 
the use of Computerized Adaptive Testing (CAT) systems, which set the 
difficulty level in accordance with the ability levels of individuals, is 
spreading. However, these systems are not prevalent in Turkey. Therefore, 
it is important to develop and assess a CAT system to be integrated into 
Turkish curriculum. 

Purpose of Study: The purpose of this study is to develop a CAT system 
that can be used in the sub-levels of Turkish curricula (high school) and 
assess it in terms of reliability. 

Methods: In this study, a CAT system aimed at the sub-topics 
(permutation, combination, binomial expansion, and probability) of the 
unit of probability covered in 11 th grade mathematics was developed. 
Estimation sensitivity of the developed CAT system, from lower to higher 
ability levels, was assessed. To this end, an exam was implemented in a 
high school located in Trabzon in the 2011-2012 fall semester. Eighty-four 
11 th graders participated in the implementation. Using the data derived 
from the implementation, reliability coefficient values of each learner were 
calculated. System records were used for data collection. 

Results: Reliability coefficient values for each student in 5 different exams - 
permutation, combination, binomial expansion, probability, and end-of- 
unit - were calculated. Findings of the study indicated that the developed 
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CAT system produced tests with high reliability for all subjects. Average 
reliability coefficient values for each subject were found as 0.93, 0.93, 0.88, 
0.93, and 0.91. 

Discussion and Conclusion: By addressing questions from lower to higher 
levels, which are appropriate to the ability level of each learner, CATs 
increase sensitivity and reliability of measurement. The developed CAT 
system addresses questions appropriate for the ability level of the learner, 
which increases sensitivity in measurement. In addition, it was observed 
that CAT systems developed for exams have the characteristic of making 
sensitive measurements ranging from lower to higher levels. This study, in 
relation to UZWEBMAT-CAT assessment, proved that CAT systems can 
safely be used within Turkish curriculum. 

Keywords: Computerized adaptive testing, individual assessment, 
individual differences, ability level estimation, adaptive testing. 


Learning-teaching activities bring along the need to determine whether they 
achieve their goals. This need is fulfilled through assessment and evaluation 
activities (Baki, 2008). Therefore, assessment and evaluation has become an 
inseparable part of the educational process. Various methods are used for assessment 
and evaluation in education. One of these methods is multiple-choice tests. These 
tests are very useful for large-scale exams in particular. Like face-to-face education, 
assessment is also needed in web-based education, as in conventional educational 
methods (Semerci & Bekta§, 2005). The advantages brought by internet technology 
make it possible to simultaneously test many students in a computer environment. 
However, the traditional assessment and evaluation form contrasts with modern 
education, where individual learning characteristics are featured. This situation has 
accelerated the transformation of traditional tests into adaptive tests, which are more 
suitable for modern education. 

Computerized adaptive testing 

Computerized adaptive testing refers to implementing different questions to 
every participant based on their ability levels (Liu & Chen, 2012). In computerized 
adaptive testing, difficulty of the test dynamically varies by the ability level of the 
participant. If the participant answers the item correctly, the next item selection 
prefers a more difficult item. If the participant answers the item incorrectly, the next 
item selection prefers an easier item (Weiss, 1985). While selecting the items, the ones 
providing the most information about the ability level of the participant are selected 
(Bejar, Weiss & Gialluca, 1977; Georgiadou, Triantafillou & Economides, 2006; 
Rudner, 2002; Weiss, 1982). For this reason, question items are separately determined 
for each participant in adaptive tests. Thus, more efficient and accurate measurement 
is performed in adaptive tests in comparison to traditional tests (Cheng, Lin & 
Huang, 2009; Triantafillou, Georgiadou & Economides, 2008; Weiss, 1985). 
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The adaptive tests provide a more efficient measurement through very few items 
in comparison to conventional tests (Cheng, Lin & Huang, 2009; Koong & Wu, 2010; 
Kreitzber, Stocking & Swanson, 1978; Lunz, Bergstrom & Gershon, 1994; Weiss, 
2004). In addition to this efficiency, measurement precision substantially increases as 
the questions providing most information about the ability level of a participant are 
asked. Moreover, the standard error rate of measurement falls thanks to the increase 
in efficiency ensured in the measurement. A decrease in standard error is an 
indicator that the ability levels of participants are measured more precisely (Bulut & 
Kan, 2012; Huang, Lin & Cheng, 2009; Liu & Chen, 2012; Vispoel, Rocklin & Wang, 
1994; Reckase, 2010; Weiss, 1985). 

Related Works 

CAT systems are used in different fields such as education, health, certification, 
and undergraduate programs. Rios, Millan, Trella, Perez-de-la-Cruz, and Conejo 
(1999) focused on the development of a test generation system module, one of the 
three components of the TREE (TRaining of European Environmental trainers and 
technicians in order to disseminate multinational skills between European countries) 
project. The test generation system was developed as a CAT system independent 
from the TREE system. The results showed that adaptive testing systems could be 
successfully integrated into e-learning systems, and more efficient evaluation could 
be ensured by this means. Eggen and Straetmans (2000) compared paper-and-pencil 
tests and a CAT application an exam, whereby students would be placed in courses 
based on their ability levels. According to the study, a decrease of 22% to 44% took 
place in the CAT application in comparison to paper-and-pencil tests in terms of the 
number of question items required. Gouli, Kornilakis, Papinakolaou, and 
Grigoriadou (2001) focused on the CAT module framework, which they integrated 
into the INSPIRE system that had been designed as an adaptive hypermedia teaching 
system. That study described process steps in the process of design and 
implementation of this module, which was integrated into the INSPIRE system. No 
evaluation was performed concerning that system. Lopez-Cuadrado, Perez, Vadillo, 
and Arruabarrena (2002) focused on the development process of the CAT system, 
which they integrated into the adaptive hypermedia system developed for language 
teaching. No evaluation was performed concerning that system. Fliege, Becker, 
Walter, Bjorner, Klapp, and Rose (2005) developed a CAT application for the 
diagnosis of depression symptoms. According to the results of the simulation carried 
out in that study, ability level could be determined at a high reliability level with an 
average of 6 items. Lopez-Cuadrado, Armendariz, and Perez (2006) focused on the 
architecture of a CAT system they named GenTAI and integration of it into an e- 
learning environment. This system was not evaluated in any study. Lilley (2007) 
conducted a CAT application and compared it with the classical test. In this study, a 
test consisting of 20 questions total was implemented. Ten of these questions were 
prepared in the form of an adaptive test, and the remaining 10 were prepared in the 
form of a classic test. A Likert-type scale was used for determining the views of 
students about these tests. The students were asked to compare the above mentioned 
two test systems in terms of the general difficulty levels of questions and tests. The 
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research findings demonstrated that questions provided in the CAT section were 
more suitable for the ability levels of participants and participants had quite positive 
attitudes concerning the CAT application. Walter, Becker, Bjorner, Fliege, Klapp, and 
Rose (2007) developed a CAT application for anxiety measurement. The sample of 
the study included 2348 psychiatry and psychosomatic patients, where a total of 13 
scales and 81 items were considered enough to measure anxiety. Simulation activity 
showed that ability level could be measured at high precision through 6 to 8 items. 
Then, the results of the anxiety-CAT application and those of the classical anxiety 
inventory were compared through 102 clinical patients. According to research 
results, anxiety-CAT application can be used for the distinction of patients with a 
mental health problem in a reliable manner, as can the classical inventories. Choi, 
Reise, Pilkonis, Hays, and Celia, D (2010) carried out a study where a fixed-length 
test and CAT application were compared within the scope of the development of an 
emotional depression scale. A depressive symptom scale composed of 28 items was 
evaluated within the scope of that study. According to the research results, there was 
quite a high relationship between all short forms and the results of CAT application. 
In that study the CAT application yielded better results in all fields in comparison to 
the short forms. Frey and Seitz (2011) focused on the usability of a multi-dimensional 
CAT system for evaluating the literacy of students within PISA (Programme for 
International Student Assessment). When the exam held through a classical testing 
system was compared with the results of a CAT exam, CAT application was seen to 
have increased the measurement efficiency by 74%. In addition, it was seen that the 
CAT application decreased the number of question items addressed to students from 
56 to 22, without any loss in the measurement precision. Oztuna (2008) developed 
and assessed a CAT system to determine the level of disability in low back pain and 
osteoarthritis of the knee. The results of this study showed that the CAT system can 
determine the levels of disability by fewer items than a classical test, and there is 
higher degree of cohesiveness with CAT and classical test results. Taking into 
account the different ability estimation procedures and test termination rules, 
Kalender (2011), the CAT ability level estimates and paper-and-pencil format of the 
student selection examination science subtest were compared. The results of the 
study showed that the CAT system provides more reliable ability level estimates 
with less items when compared to paper-and-pencil test format. 

Considering the literature, CAT systems are commonly studied abroad. 
However, this is quite a new study area for our country. The aim of this study is to 
contribute to the use of CAT systems in our country's curriculum. To this end, a CAT 
system that can be used in the secondary education of our country's curriculum was 
designed and assessed. In this sense, this study is an attempt to answer the following 
research question in order to evaluate the system called UZWEBMAT-CAT: 

• What is the reliability level of the CAT system developed for the unit of 
probability covered in the 11th grade mathematics curriculum? 
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Method 


Research Design 

In this study, a CAT system aimed at the sub-topics (permutation, combination, 
binomial expansion, and probability) of the emit of probability covered in the 11 th 
grade mathematics course was developed. The developed CAT system's 
measurement accuracy was evaluated for all ability levels from low to high. To that 
end, the exam application was conducted at a high school located in Trabzon, 
Turkey, in the fall semester of the 2011-2012 academic year. The reliability coefficient 
values of each student test were calculated using the data obtained from exam 
application. 

Procedure 

The development of the CAT system 

Forming an item pool. One of the most important elements of a CAT system is 
forming an item pool. The item pool consists of items concerning the characteristic 
that is planned to be measured. For this study, the characteristics planned to be 
measured are content knowledge, which include permutation, combination, binomial 
expansion, and probability. In the absence of items of known parameters (difficulty, 
item discrimination, guessing) and measure content knowledge of these topics, the 
new items were created. Items should be implemented on a certain number sample 
group to determine the parameters. The sample must be educated on these topics to 
answer these items. Therefore, 11 th and 12 th grade students were selected as the 
sample group to determine item parameters. The created test sheets were 
implemented at 11 different high schools located in Trabzon, Turkey, during the fall 
semester of the 2010-2011 academic year. The number of students undergoing each 
item varied between 605 and 654. The total number of students participating in the 
activity was 3146. 

Analyzing the question items by the item response theory. Adaptive tests use Item 
Response Theory (IRT) instead of classical test theory (Wise & Kingsbury, 2000). The 
reason for this is that IRT has models that do not use the statistics in the group like 
classical test theory. This situation enables individual assessment (Hambleton, 
Swaminathan & Rogers, 1991; Rios et al., 1999; Ponsoda, 2000; Weiss, 2004; Marinagi, 
Kaburlasos & Tsoukalas, 2007). Item analysis was performed according to IRT after 
implementation of the created items at schools. IRT has two assumptions to fulfill: 
unidimensionality and local independence. Unidimensionality shows that items of 
the test measure only one character. Local independence refers to answers given to 
items of the test being statistically independent. The existence of a dominant factor 
over test answers is regarded as evidence of unidimensionality of the test. Also, 
fulfillment of the unidimensionality assumption shows that the local independence 
assumption is also fulfilled. (Hambleton et al., 1991). All test sheets developed in this 
study were subjected to factor analyses, and whether or not these tests were 
uni dimensional was searched. The results of the factor analyses showed that all tests 
were unidimensional. In other words, each of the created tests (permutation. 
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combination, binomial expansion, and probability) measured its own content 
knowledge. Thus, assumption of local independence was also fulfilled. 

Investigating model fit. The IRT model, based on which the answers obtained from 
the sub-tests of the unit of probability would be analyzed, was determined through 
model fit statistics. The MULTILOG 7.03 program was used to determine the model 
fit as well as the item parameters. All test sheets were analyzed, and the IRT model to 
be employed was determined. Analysis results demonstrated that the tests were fit 
for the 3-Parameter Logistic (3PL) model. 

Examining item parameters. The parameters of each item need to be examined in 
order to decide on the items to be used in an adaptive testing application. Some items 
may not be in the form of an "S", which is the ideal form of the item characteristic 
curve. The reason for this may be that the said items have low discrimination and 
low difficulty values, or have a high chance parameter. Apart from that, the items 
with a negative discrimination value are also problematic items (Onder, 2007). The 
items considered problematic are removed from the test. In this sense, the item 
characteristic curves and item parameters of the items included in the tests were 
examined. To this end, the MULTILOG 7.03 program was used. The item 
characteristic curves of a sample test randomly chosen among tests are given below. 
Figure 1 shows the item characteristic curves of all items of the Permutation-8 test. 



Figure 1. The item characteristic curves of all items of the permutation-8 test 
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As is seen in Figure 1, the item characteristic curve of the 6 th item is a reverse S. 
This is because this item has a negative discrimination value. Thus, the 6 th item was 
found to be problematic. Item parameters were examined after the item characteristic 
curves had been observed in order to better understand whether the items were 
problematic. Table 1 presents the item parameters of the Permutation-8 test based on 
3PL. 

Table 1 

The Item Parameters of the Permutation-8 Test According to 3PL 


Items 

item discrimination difficulty (b) 

(a) 

guessing (c) 

1 . 

1.187 

-0.125 

0 

2. 

0.459 

0.966 

0 

3. 

0.475 

0.94 

0 

4. 

0.193 

-1.159 

0 

5. 

1.039 

0.254 

0 

6. 

-2.36 

-1.76 

0 

7. 

0.285 

0.235 

0.379 

8. 

0.852 

-0.025 

0 

9. 

0.94 

0.296 

0 

10. 

0.438 

1.029 

0 

11. 

1.969 

-0.282 

0.362 

12. 

2.344 

-0.726 

0 

13. 

1.146 

-0.106 

0 

14. 

1.228 

0.165 

0 

15. 

2.125 

1.3 

0.328 

16. 

1.274 

-0.023 

0 

17. 

1.026 

0.043 

0 

18. 

0.952 

-0.039 

0 

19. 

0.659 

1.69 

0 

20. 

1.511 

0.172 

0 

21. 

0.765 

0.906 

0 


22 . 


0.852 


0.503 


0 
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23. 

1.802 

0.645 

0 

24. 

0.487 

-1.585 

0 

25. 

2.116 

-0.689 

0 

26. 

1.108 

0.022 

0 

27. 

0.821 

0.203 

0 


The discrimination parameter of the 6 th item was found to be lower than zero. 
Therefore, the 6 th item was removed from the test. The remaining 26 items were 
included in the item pool. Such analyses were performed for all test sheets pertaining 
to the unit of probability. The analysis of test data showed that 86 items were 
problematic in all test sheets. These problematic items were removed from tests, and 
it was decided to include a total of 752 questions in the question pool. The 
distribution of questions in the final item pool by subtopics is as follows: 
permutation: 239; combination: 159; binominal expansion: 102; and probability: 252. 

Designing and encoding the test system in the computer environment. This section 
focuses on the process steps in the process of designing and encoding a CAT system 
in the computer environment. The formation of the item pool was followed by the 
determination of the ability level estimation procedure, the item selection procedure, 
and the termination rule. In this study. Maximum Likelihood Estimation (MLE) was 
used as the ability level estimation procedure. Maximum Information Selection (MIS) 
was used in the item selection procedure. The fixed number termination rule was 
employed for terminating the testing session in the present study. Thus, the numbers 
of questions were as follows: permutation test: 15; combination test: 15; binominal 
expansion test: 15; probability test: 15; and end-of-unit test: 20. 

Integrating the CAT system into UZWEBMAT. The developed CAT system was 
integrated to UZWEBMAT. UZWEBMAT is a Visual-Auditory-Kinesthetic (VAK) 
learning styles-based individualized, adaptive, and intelligent web-based 
mathematics learning environment (Ozyurt, Ozyurt & Baki, 2013). The UZWEBMAT 
consist of the subjects of permutation, combination, binomial expansion, and 
probability, which are covered in the unit of probability within the secondary 
education mathematics curriculum. The UZWEBMAT system is a dynamic learning 
environment that can adapt itself to the VAK learning styles and learning 
characteristics of students. The prepared CAT system was integrated into this 
environment as an assessment and measurement module. This module is named 
UZWEB MAT-CAT. 

Sample 

The developed testing system was employed in an exam in the school 
environment, and the obtained data were evaluated. The exam was conducted at a 
high school located in Trabzon, Turkey, in the fall semester of the 2011-2012 
academic year. A total of 84 11 th grade students took part in the exam. The names of 
students were kept confidential. Their names were encoded as Stdl, Std2,..., Std84 in 
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the present study. All actual information about the students was kept confidential in 
all figures. 

Research Instruments 

System records were used for data collection. System records contain level, score, 
exam session standard error value. It also includes learning style information 
belonging to each student in all exams. 

Data Analysis 

The reliability of IRT-based CATs is measured via standard error amount. As the 
standard error ratio decreases, the reliability of the test increases. Standard error is 
calculated as inversely proportional to the square root of the knowledge amount 
provided by the test (Hambleton, et ah, 1991). Standard error amounts in IRT were 
converted to a classical reliability coefficient based on the formula suggested by 
Thissen (2000) in order for them to be better understood. Classical reliability 
coefficient ranges are as follows: 0-0.4: unreliable; 0.4-0.6: low reliability; 0.6-0.8 fair 
reliability; 0.8-1.0: high reliability (Ozdamar, 2004). The relationship between 
standard error amount and reliability coefficient is as follows: 

r = 1 - SEJBf (1) 

r: Reliability coefficient 

SEJ6 ): The standard error amount of the test at 9 ability level 


Results 

The functioning and reliability of UZWEBMAT-CAT 

This section provides UZWEBMAT-CAT's function and findings in relation to 
testing reliability. The functioning of a testing system - how the system approaches 
the real ability level over the given responses and the meantime change in reliability 
coefficient value - were presented to concretize the change. Exam data of three 
learners with different ability levels was used to show how UZWEBMAT-CAT 
moves from lower to higher levels. Figure 2 presents the data of three different 
students at the end- of-unit test. Std84 represented the higher level, while Stdll and 
Std80 represented intermediate and lower levels, respectively. 
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Figure 2. The comparison of the progress of Std84, Stdll, and Std80 in the end-of- 
unit test 


As seen in Figure 2, first ability estimations of Std84, Stdll, and Std80 were made 
in the 7 th , 4 th , and 2 nd questions, respectively. This is because the MLE procedure 
needs at least one correct and one incorrect response to estimate the level (Weiss, 
1982). The first ability level estimation of Std84 was 3. According to the responses of 
Std84, UZWEBMAT-CAT either increased or decreased difficulty levels of questions 
and estimated the ability level in each step. Considering the general situation of 
Std84 in this exam and their ability level at the end of the exam (1981), it is possible to 
say that they had a high level of ability. According to the responses of Stdll and 
considering their general situation in the end-of-unit test and final level of ability 
(1,023), it is possible to say that they had a medium level of ability. Finally, 
considering the general situation of Std80 in the end-of-unit test and final level of 
ability (-0.257), it is possible to say that they had a low level of ability. Examining 
Figure 2 in detail, the extent of change in ability level estimations of these three 
learners with different levels is remarkable. As the test progresses, the difference 
between ability level estimations of each learner reduces gradually. This means that 
UZWEBMAT-CAT takes a step closer to the real ability levels of learners that were 
being estimated. Thus, the range in which real ability level is estimated shrinks. The 
estimation obtained at the end of the test is vitally close to the value of their real 
ability levels. This can be inferred from standard errors and reliability coefficient 
values calculated depending on this fact. The reliability of the system was 
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investigated by using the system records obtained from the exam conducted through 
the UZWEBMAT-CAT. (The reliability of the system was investigated using system 
records obtained from exam implementation carried out via UZWEBMAT-CAT and 
detailed below). 

For permutation, the reliability coefficient of the test taken by 3 students is 
between 0.78 and 0.79; the reliability coefficient of the test taken by 5 students is 
between 0.83 and 0.89; and the reliability coefficient of the test taken by remaining 76 
students is between 0.90 and 0.96. Based on these values, it is seen that the test taken 
by 3 students for permutation is fairly reliable, while the tests taken by the other 
student have a very high reliability. For combination, the reliability coefficient of the 
test taken by 7 students is between 0.83 and 0.89; and the reliability coefficient of the 
test taken by remaining 77 students is between 0.90 and 0.97. Based on these values, 
the tests taken by all students for the combination test have a very high reliability. 
For binominal expansion the reliability coefficient of the test taken by 3 students is 
between 0.76 and 0.78; the reliability coefficient of the test taken by 49 students is 
between 0.81 and 0.89; and the reliability coefficient of the test taken by the other 32 
students is between 0.90 and 0.95. Based on these values, it the tests taken by 3 
students for binominal expansion tests are fairly reliable, and the tests taken by the 
remaining 81 students is very reliable. For probability, the reliability coefficient of the 
test taken by 1 student is 0.78; the reliability coefficient of the test taken by 16 
students is between 0.80 and 0.89; and the reliability coefficient of the test taken by 
the remaining 67 students is between 0.90 and 0.97. Based on these values, the test 
taken by 1 student for binominal expansion is fairly reliable, and the tests taken by 
the remaining 83 students are very reliable. Finally, for end-of-unit, the reliability 
coefficient of the test taken by 2 students is between 0.76 and 0.78; the reliability 
coefficient of the test taken by 15 students is between 0.80 and 0.89; and the reliability 
coefficient of the test taken by the remaining 67 students is between 0.90 and 0.97. 
Based on these values, the test taken by 2 students for the end-of-unit test is fairly 
reliable, and the tests taken by the remaining 82 students are very reliable. Table 2 
shows minimum, maximum, average standard error, and traditional standard 
coefficient values for each test between ability levels -3 and +3. In Table 2, “was 
used since there is no student estimated with an ability level between -3 and -1. 
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Table 2 

Standard Error Amounts and Reliability Coefficients in All Tests 


Ability Level 



(-3,-2) 

(-2,-1) 

(-1,0) 

(0,1) 

(1,2) 

(2,3) 



Min 



0.21 

0.21 

0.21 

0.24 


error 

Max 



0.25 

0.25 

0.47 

0.45 

Permu 

amount 

Average 



0.238 

0.237 

0.235 

0.32 

test 


Min 



0.94 

0.93 

0.78 

0.79 


reliability 

Max 



0.96 

0.96 

0.96 

0.94 


coefficient 

Average 



0.943 

0.943 

0.944 

0.897 



Min 



0.17 

0.18 

0.18 

0.2 


error 

Max 



0.22 

0.24 

0.27 

0.4 

Combi 

amount 

Average 



0.175 

0.21 

0.24 

0.29 

test 


Min 



0.95 

0.94 

0.83 

0.84 


reliability 

Max 



0.97 

0.97 

0.97 

0.96 


coefficient 

Average 



0.967 

0.955 

0.942 

0.915 



Min 



0.23 

0.25 

0.24 

0.25 

Binomi 

error 

Max 



0.41 

0.41 

0.41 

0.4 

al 

amount 

Average 



0.264 

0.284 

0.34 

0.4 

sion 

Traditional 

Min 



0.83 

0.84 

0.83 

0.76 

test 

reliability 

Max 



0.95 

0.94 

0.94 

0.94 


coefficient 

Average 



0.93 

0.919 

0.884 

0.84 



Min 



0.17 

0.17 

0.17 

0.19 


error 

Max 



0.24 

0.24 

0.41 

0.47 

Probab 

ility 

test 

amount 

Average 



0.187 

0.191 

0.253 

0.359 

Traditional 

Min 



0.94 

0.94 

0.83 

0.78 


reliability 

Max 



0.97 

0.97 

0.97 

0.96 


coefficient 

Average 



0.964 

0.963 

0.935 

0.87 



Min 



0.18 

0.19 

0.16 

0.25 


error 

Max 



0.26 

0.42 

0.47 

0.49 

End of 

Unit 

amount 

Average 



0.208 

0.252 

0.286 

0.35 

test 

Traditional 

Min 



0.93 

0.83 

0.78 

0.76 


reliability 

Max 



0.97 

0.96 

0.97 

0.94 


coefficient 

Average 



0.956 

0.936 

0.917 

0.877 
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Discussion and Conclusion 

The research findings show that the developed UZWEBMAT-CAT application 
generates highly reliable results. "Performing high precision and high reliability 
measurement through adaptive testing", which was suggested in the literature, was 
realized in the UZWEBMAT-CAT application (Eggen & Straetmans 2000; Choi et al., 
2010; Frey & Seitz, 2011). The results of the UZWEBMAT-CAT application show 
parallelism to the results of the above-mentioned studies. 

Weiss (1985) argues that CAT applications provide high measurement precision 
because they provide each individual with questions suitable for their own levels. 
The UZWEBMAT-CAT application also ensured high measurement efficiency 
because it provided every individual with questions suitable for their own ability 
levels. Another feature of CAT applications focused on in the literature is that they 
perform a very precise measurement at all ability levels, from low to high, because 
they provide every individual with tests suitable for their own ability levels (Eggen 
& Straetmans 2000; Frey & Seitz, 2011; Hambleton, 1990; Weiss, 1982; Wise & 
Kingsbury, 2000). The data obtained from the UZWEBMAT-CAT application support 
this conclusion. As a matter of fact, according to the data obtained from this 
application, ability level and reliability coefficient ranges of each test are as follows. 
Ability levels: between -0.618 and 2.437 for permutation test; between -0.427 and 
2.481 for combination test; between -0.784 and 3 for binominal expansion test; 
between -0.71 and 2.436 for probability test; and between -0.401 and 2.312 for end-of- 
unit test. Reliability coefficients: between 0.78 and 0.96 for permutation test; between 
0.83 and 0.97 for combination test; between 0.76 and 0.95 for binominal expansion 
test; between 0.78 and 0.97 for probability test; and between 0.76 and 0.97 for end-of- 
unit test. Based on all of these values, it can be said that subject tests perform high 
precision measurements for all ability levels. This study, in relation to the 
assessment of UZWEBMAT-CAT, prove that CAT systems can safely and efficiently 
be employed within Turkish curriculum. 
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Ogrencilerin Olasilik Unitesi Bilgi Seviyelerinin Biigisayarlastirilmis 
Uyarlanabilir Test ile Kestirilmesi 


Afaf: 

Ozyurt, H. & Ozyurt, O. (2015). Ability Level Estimation of Students on Probability 
Unit via Computerized Adaptive Testing. Eurasian Journal of Educational 
Research, 58, 27-44. http:// dx.doi.org/10.14689/ejer.2015.58.7 


Ozet 

Problem Durumu : Ogrenme-ogretme faaliyetleri beraberinde bu faaliyetlerin amacina 
ula§ip ula§madigmm belirlenmesi ihtiyacmi getirmektedir. Bu ihtiyag olgme ve 
degerlendirme faaliyetleri ile giderilmektedir. Bu nedenle olgme ve degerlendirme 
egitim-ogretim stirecinin aynlmaz bir pargasi haline gelmi§tir. Egitimde olgme ve 
degerlendirme amaciyla ge§itli yontemler kullamlmaktadir. Bu yontemlerden biri de 
goktan segmeli testlerdir. Bu testier ozellikle geni§ olgekli sinav uygulamalari igin 
oldukga kullam§lidir. Ytiz ytize egitimde oldugu gibi internet temelli egitimde de 
tipki klasik egitim yontemlerinde oldugu gibi olgmeye ihtiyag duyulmaktadir. 
Internet teknolojisinin getirdigi avantajlarla bilgisayar ortammda gok sayida 
ogrencinin aym anda smav olmasi mtimkundur. Ancak bu geleneksel olgme- 
degerlendirme §ekli, bireysel ogrenme ozelliklerinin on plan giktigi guniimuz 
modern egitim anlayi§ma ters du§mektedir. Bu durum, geleneksel testlerin yeni 
egitim anlayi§ma daha uygun olan uyarlanabilir testlere donu§mesini 
hizlandirmi§tir. Bu nedenle bireylerin bilgi seviyelerine gore sorularm zorluk 
diizeyini belirleyen Bilgisayarla§tmlmi§ Uyarlanabilir Test (BUT) sistemleri giderek 
yaygmla§maktadir. BUT'larda testin zorlugu, katilimcmm bilgi seviyesine gore 
dinamik olarak degi§ir. Katilimci soruyu dogru yanitlarsa sonraki soru daha zor bir 
olacak §ekilde segilir. Katilimci soruyu yanli§ yanitlarsa sonraki soru daha kolay bir 
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soru olacak §ekilde segilir. Sorular segilirken katilimcimn bilgi seviyesi hakkmda en 
fazla bilgiyi saglayan maddeler segilir. Boylece uyarlanabilir testlerle geleneksel 
testlere gore daha verimli ve hassas olgiim yapilmaktadir. BUT sistemleri Madde 
Tepki Kurami'ra (MTK) temel alan uygulamalardir. Yurt di§mda oldukga yaygm 
olan BUT sistemleri Ttirkiye'de hentlz yaygmla§mami§tir. Bu nedenle Turk egitim 
sistemi igerisinde kullamlabilecek bir BUT sisteminin geli§tirilmesi ve 
degerlendirilmesi onem ta§imaktadir. 

Ara§tirmamn Amaci: Bu galigmanin ana amaci Turk egitim sisteminin alt 
basamaklannda kullamlabilecek (lise dtizeyi) bir BUT sistemi geli§tirmek ve 
gtivenirlik dtizeyi bakimmdan degerlendirmektir. 

Yontem: Bu gali§mada 11. sintf matematik dersi olasilik tinitesinin alt konularma 
(permtitasyon, kombinasyon, binom agilimi ve olasilik) yonelik bir BUT sistemi 
geli§tirilmi§tir. BUT sisteminin geli§tirilmesi, soru maddelerinin geli§tirilmesi, 
uygulanmasi ve MTK'ya gore analiz edilmesi gibi i§lemlerden olu§an madde havuzu 
geli§tirme a§amasi, sistemin bilgisayar ortammda kodlanmasi ve uyarlanabilir zeki 
web tabanli matematik ogrenme ortami UZWEBMAT'a entegre edilmesi 
a§amalarmdan olu§maktadir. Geli§tirme a§amasmdan sonra BUT sisteminin 
dti§tikten yiiksege biitiin bilgi seviyelerinde olgiim hassasiyeti degerlendirilmi§tir. 
Bunun igin 2011-2012 ogretim yili giiz doneminde Tiirkiye'de Trabzon ilindeki bir 
lisede sinav uygulamasi yapilmi§tir. Smav uygulamasma toplam 84 11. sirnf ogrencisi 
katilmi§trr. Smav uygulamasmdan elde edilen veriler kullanilarak her bir ogrencinin 
aldigi testlerin giivenirlik katsayilari hesaplannu§tir. MTK temelli BUT'larm 
giivenirligi, standart hata miktari ile olgiilmektedir. Standart hata oram dii§tiikge 
testin giivenirligi artmaktadir. Standart hata miktari klasik giivenirlik katsayisma 
doniiijtiiriilerek sunulmu§tur. Veri toplama araci olarak sistem kayitlari 
kullanilmi§tir. 

Am§tirmanm Bulgulan: Test sisteminin i§leyi§i, sistemin verilen yanitlara gore her 
adimda gergek bilgi seviyesine nasil yakla§tigmi ve bunu yaparken giivenirlik 
katsayismdaki degi§imi somutla§tirmak igin sunulmu§tur. UZWEBMAT-CAT'in 
dti§iikten yiiksege biitiin bilgi seviyelerinde nasil hareket ettigini gostermek igin bilgi 
seviyeleri farkli iig ogrencinin unite sonu testi verileri kullanilmi§tir. Ogrenciler 
segilirken ytiksek, orta ve dti§tik bilgi seviyesi olarak nitelendirilebilecek ogrenciler 
olmasma dikkat edilmi§tir. Farkli bilgi seviyesine sahip bu iig ogrencinin ilk sorudan 
son soruya dogru bilgi seviyesi kestirimlerindeki degi§im miktari dikkat 
gekmektedir. Test ilerledikge her bir ogrencinin kendi bilgi seviyesi kestirimleri 
arasmdaki fark giderek azalmaktadir. Bunun anlanu ogrencilerin kestirilmeye 
gali§ilan gergek bilgi seviyelerine UZWEBMAT-CAT'in her adimda biraz daha 
yakla§masidir. Boylece gergek bilgi seviyesinin kestirilmeye gali§ildigi aralik 
daralmaktadir. Testin sonunda elde edilen kestirim ogrencilerin her birinin gergek 
bilgi seviyelerine gok gok yakm bir deger olmaktadir. Bu durum testin standart hata 
ve buna bagli olarak hesaplanan giivenirlik katsayisi incelenerek anla§ilabilir. 
Permiitasyon, kombinasyon, binom agilimi, olasilik ve unite sonu testi olmak iizere 
toplam 5 ayri smav igin ogrencilerin her birinin aldigi testin giivenirlik katsayilari 
hesaplannu§tir. Bunun igin her bir testin standart hata miktari klasik giivenirlik 
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katsayisma donti§ttirtilmti§ttir. (^ali§manin bulgulari geli§tirilen BUT sisteminin tlim 
konular igin ytiksek glivenirlikte testier tirettigini gostermektedir. Her bir konuya 
ili§kin testlerin ortalama gtivenirlik katsayilari 0.93, 0.93, 0.88, 0.93 ve 0.91 olarak 
hesaplannu§tir. Geli§tirilen BUT sistemi dti§tikten ytiksege ttim bilgi seviyelerinde 
hassas ve gtivenilir testier tirettigi gortilmti§ttir. 

Tartigma ve Sonug: Literattirde BUT ile ilgili tizerinde durulan onemli ozelliklerinden 
biri de her bir bireye kendi bilgi seviyesine uygun bir test sundugu igin diiijukten 
ytiksege btitiin bilgi sevilerinde gok hassas olgtim yapma ozelligidir. Bu ozelligin 
geli§tirilen BUT sistemi tarafmdan saglandigi gortilmu§ttir. Nitekim farkli bilgi 
seviyesine sahip ogrencilerin aldiklari testlerin ytiksek giivenirlige sahip oldugu 
goriilmti§tur. Literattirde tizerinde durulan bir diger durum da BUT'larm her bireye 
kendi bilgi seviyesinde sorular yonelttigi igin olgtimtin standart hata miktarmm 
dti§ttigti ve dolayisiyla olgtim hassasiyetinin artmasidir. Geli§tirilen BUT sisteminin 
de her bir ogrenciye kendi bilgi seviyesinde sorular yonelttigi boylece olgtim 
hassasiyetinin arttigi gortilmti§ttir. Yapilan testlerin gtivenirlik katsayilarmin deger 
araliklarma bakildigmda az sayida oldukga gtivenilir ve gok sayida gok gtivenilir 
araliklarma giren degerler oldugu gortilmektedir. UZWEBMAT-CAT'in 
degerlendirilmesiyle ilgili bu gali§ma ile Turk egitim sisteminde de BUT 
sistemlerinin gtivenilir bir §ekilde kullanilabilecegi ortaya konulmu§tur. 

Anahtar Sdzciikler: Bilgisayarla§tmlmi§ uyarlanabilir test, Bireysel degerlendirme, 
Bireysel farkliliklar, Bilgi seviyesi kestirimi, Uyarlanabilir test. 



